-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support remapping for IVF_FLAT, IVF_PQ and IVF_SQ #2708
Conversation
prepare for supporting remap for new vector index format, HNSW remap not supported because simply mapping the row ids could break the connectivity of graph Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2708 +/- ##
==========================================
+ Coverage 78.80% 79.00% +0.19%
==========================================
Files 246 246
Lines 86637 86900 +263
Branches 86637 86900 +263
==========================================
+ Hits 68278 68655 +377
+ Misses 15529 15378 -151
- Partials 2830 2867 +37
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@BubbleCal I've marked this as draft, since I'm assuming it is not ready for review. (There are no unit tests.) Mark it as ready for review when it is ready. |
…ctor-index Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
lance_io::ReadBatchParams::RangeFull, | ||
4096, | ||
16, | ||
projection, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need the part_id
in batch, just don't read it to save resources
Signed-off-by: BubbleCal <[email protected]>
@@ -134,6 +135,10 @@ impl IvfSubIndex for FlatIndex { | |||
Ok(Self {}) | |||
} | |||
|
|||
fn remap(&self, _: &HashMap<u64, Option<u64>>) -> Result<Self> { | |||
Ok(self.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's add a warning log here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh w8, we should remap sub index here no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for v3 we need to remap the subindex & vector storage. flat index doesn't contain anything so it simply returns itself here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flat map is still { origin_vector: row_id }
? if row id
changes during compaction, we need to remap them ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remap on an vector index (v3) is:
- remap the sub index
- remap the storage
for IVF_FLAT, the sub index isFLAT
and storage isFlatStorage
. FLAT sub index doesn't contain any data so no need to do anything here. the remapping happens onFlatStorage
let batch = concat_batches(self.schema(), batches.iter())?; | ||
Self::try_from_batch(batch, self.distance_type()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess remap is already slow so it probably doesn't matter but it seems odd we would need to concat here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it's because try_from_batch
is not trivial, e.g. for PQ storage, it would transpose the pq codes
let element_type = get_vector_element_type(dataset, column)?; | ||
match element_type { | ||
DataType::Float16 | DataType::Float32 | DataType::Float64 => { | ||
IvfIndexBuilder::<FlatIndex, FlatQuantizer>::new( | ||
dataset.clone(), | ||
column.to_owned(), | ||
dataset.indices_dir().child(uuid), | ||
params.metric_type, | ||
Box::new(shuffler), | ||
Some(ivf_params.clone()), | ||
Some(()), | ||
(), | ||
)? | ||
.build() | ||
.await?; | ||
} | ||
DataType::UInt8 => { | ||
IvfIndexBuilder::<FlatIndex, FlatBinQuantizer>::new( | ||
dataset.clone(), | ||
column.to_owned(), | ||
dataset.indices_dir().child(uuid), | ||
params.metric_type, | ||
Box::new(shuffler), | ||
Some(ivf_params.clone()), | ||
Some(()), | ||
(), | ||
)? | ||
.build() | ||
.await?; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed there are many lines are doing the same thing: get the vector data type / value type and check it.
so just made the function get_vector_element_type
to do this
// async fn append(&self, batches: Vec<RecordBatch>) -> Result<()> { | ||
// IvfIndexBuilder::new( | ||
// dataset, | ||
// column, | ||
// index_dir, | ||
// distance_type, | ||
// shuffler, | ||
// ivf_params, | ||
// sub_index_params, | ||
// quantizer_params, | ||
// ) | ||
// } | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah these lines are commented and not used, so just removed them
async fn write_batches( | ||
path: Path, | ||
batches: impl Iterator<Item = RecordBatch>, | ||
schema: Schema, | ||
) -> Result<usize> { | ||
let object_store = ObjectStore::local(); | ||
let writer = object_store.create(&path).await?; | ||
let mut writer = FileWriter::try_new(writer, schema, Default::default())?; | ||
for batch in batches { | ||
writer.write_batch(&batch).await?; | ||
} | ||
Ok(writer.finish().await? as usize) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't have to be part of this PR but it might be nice to have this as a static method on FileWriter
.
rust/lance/src/index/vector/ivf.rs
Outdated
) -> Result<()> { | ||
let index_dir = dataset.indices_dir().child(new_uuid); | ||
let element_type = get_vector_element_type(dataset, &column)?; | ||
match index.index_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to add a remap
method to the VectorIndex
trait instead of using a match statement here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah just tried, it can work!
} | ||
} | ||
|
||
async fn test_remap_impl<T: ArrowPrimitiveType>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this only test the case where rows are deleted or does it also test the case where fragments are combined and row ids are changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
.open_vector_index(q.column.as_str(), &index.uuid.to_string()) | ||
.await?; | ||
let mut q = q.clone(); | ||
q.metric_type = idx.metric_type(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this fixes a bug that with unindexed data, the flat search may compute the distances in a different distance type
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cargo bump triggered a substrait update which is causing the MSRV failure. I'll make a PR to bump our MSRV (probably the easiest fix and 1.80 has been out for six months). No strong opinion on whether you wait for that PR or just merge and break CI.
not support IVF_HNSW_* index yet